Salesforce Data Cloud Ingestion from SharePoint
Application details
Technical considerations
- This solution is designed for SharePoint Online (not SharePoint Server)
- A client application must be registered in Microsoft Entra ID to use this application
- The Mule application uses Microsoft Graph APIs to collect information and does not use the legacy SharePoint REST Services
- One instance of the Mule application is deployed per SharePoint site and will monitor all Document Libraries in the site for changes
- SharePoint content is delivered in the preferred MIME when possible and in PDF, HTML, as-is, or optionally as Base64-encoded text (in that order) when not possible
- The Mule application does not support subscriptions for change notifications
- The /ping endpoint makes an authenticated request to SharePoint by attempting to obtain a list of document libraries
- The Mule application is designed to be stateless except for the full refresh scenerio, which requires some state to be maintained to sequentially process multiple document libraries
Activity diagrams
The following activity diagrams illustrate the sequence of processing to ingest the unstructured metadata and its content on-demand.
Initial Load/Full Refresh Synchronous
Initial Load/Full Refresh Asynchronous
Incremental Load
Get Content
Processing logic
The primary handling and orchestration of unstructured metadata ingestion will be implemented in the Salesforce Data Cloud Ingestion from the SharePoint Process API. This process is described in more detail in the following sections.
Initial Load/Full Refresh Synchronous
- A user action from Data Cloud initiates the request for a full refresh of the site metadata
- Data Cloud invokes the Mule application without a continuation token to start the process
- Mule application receives the request and will:
- Enumerate all libraries in the configured site (only on calls that do not include a continuation token)
- Create an object store entry with a list of all libraries (only on calls that do not include a continuation token)
- Filter out libraries with no content (only on calls that do not include a continuation token)
- Create an artificial continuation token to return to Data Cloud (virtual token) (only on calls that do not include a continuation token)
- Retrieve the site metadata from SharePoint Online
- Transform the results into the Data Cloud required format
- Maintain the state of which libraries have been completely fetched and which still remain
- Maintain the state of the SharePoint continuation token, which is issued by SharePoint for each library
- Data Cloud invokes the Mule application in a loop to handle pagination and retrieve metadata until all the metadata content has been retrieved by using the continuation token provided in a previous response
Initial Load/Full Refresh Asynchronous
- Mule application receives a request to perform an asynchronous refresh of all metadata and will:
- Enumerate all libraries in the configured site (only if the object store is empty - initialization)
- Create an object store entry with a list of all libraries (only if the object store is empty - initialization)
- Filter out libraries with no content (only if the object store is empty - initialization)
- Retrieve the site metadata from SharePoint Online
- Transform the results into the required format for the Data Cloud Ingestion API
- Send the transformed data to the ingestion endpoint
- Maintain the state of which libraries have been completely fetched and which still remain
- Maintain the state of the SharePoint continuation token, which is issued by SharePoint for each library
- Loop through all libraries following above steps
- If an existing asynchronous operation is running or no libraries with content are found, a 429 (Conflict) HTTP status is returned
Get Content
- Data Cloud initiates the request to retrieve the content
- Mule application receives the request to retrieve and stream the content from SharePoint Online
- Mule application will attempt to transcode the file to the preferred MIME type as requested by Data Cloud and as supported by Microsoft Graph service
Important notes:
- A resource identifier for content retrieval is a concatenation of the drive (library) identifier, a comma separator, and the internal resource identifier (for example,
b!yit6plLgAkK3fK_nKKrZd7jE-m_vJGdOgFMp-7pHxbuaIfzv_0USQI1QqN5WM8NB,01FTGOZOM67OM7V6PF2JDK7VCQHM7DV6YZ
). - Requesting binary content with the
encodeBinaryContent
flag set to true will disable streaming due to the nature of the Base64 encoding operation. This may result in request timeouts when attempting to encode very large files.
Incremental Load
- Mule application runs a scheduler at a given frequency
- If an entry does not exist in the object store, the Mule application will:
- Enumerate all libraries in the configured site (only if the object store is empty - initialization)
- For each library, call Delta Query with
token=latest
parameter to obtain the current/latest token - Create an object store entry with a list of all libraries and the latest Delta token (only if the object store is empty - initialization)
- End the process, and the next scheduled execution will locate changes
- If an entry exists in the object store, the Mule application will:
- Fetch the object store state for all libraries including the Delta token
- For each library, call the Delta Query with
token=value
from the Object store to obtain the changes since the last execution - Publish the metadata to the ingestion API
- Update the object store per library with the most recent Delta token
- If Delta Query has paginated results, the Mule application will follow the "nextLink" until there are no more pages, publishing each page of results to the Data Cloud Ingestion API
Success conditions
Upon successful completion, the following conditions will be met:
- All metadata associated with unstructured content in the document libraries in SharePoint Online is retrieved and processed.
- The full load of metadata is retrieved on-demand.
- An incremental load of metadata is uploaded to Data Cloud on a scheduled frequency.
- Retrieval of content in PDF and HTML is supported.